EDA and feature engineering

Baselines

In assessing the categorical features, it appears that having diabetes and being a smoker are associated with an increased risk of intubation or death. The use of immunosuppressants also seems to be associated with increased risk but very few patients appear to be taking an immunosuppressant. The continuous outcomes also seem to have different distributions. The ages of the patients who experience an event is much lower than that of patients who do not experience an event. Similarly, the duration of symptoms is much longer on average for a patient without an event. Finally, the BMIs are on average higher for the patients without an event.

## Continuous varaibles for patients with event:
age bmi duration_symptoms
Min. 5.408467 9.861328 1.000000
1st Qu. 46.445732 22.435642 5.000000
Median 59.334673 25.998552 9.000000
Mean 58.638100 26.002439 8.212403
3rd Qu. 70.614480 29.197509 10.000000
Max. 95.083451 50.830828 27.000000
## Continuous varaibles for patients without event:
age bmi duration_symptoms
Min. 16.08475 15.89455 1.000000
1st Qu. 62.04439 24.93572 6.000000
Median 73.54337 28.51278 9.000000
Mean 71.78282 29.52817 9.530758
3rd Qu. 82.09736 32.64023 12.000000
Max. 113.67434 58.90469 35.000000

Lab data

To clean the data, I calculated the timepoint in hours relative to the initial reading for all labs and for all subjects. I also linearly imputed the missing data using the imputeTS package. Originally, I intended to use the ARIMA imputation method given the autoregressive nature of the data, but this resulted in physiologically impossible imputed values given the nature of the repeated NA values.

To engineer the features, first, I selected summary statistics of the lab values to add to the features, including mean, median, minimum, and maximum values. Initially, I’d intended to incorporate some trend features. However, the measures demonstrated autoregressive properties, appearing to be cyclical over the course of the day but more or less stationary. I also added measures of the distribution characterstics, including skew and kurtosis for each variable.

Finally, I compared the lab values that were provided to published literature to see if any are recongized as strong predictors of intubation in COVID-19 patients. One publication cited oxygen saturation (SpO2) < 90% and respiratory rate >24 breaths/min as key features of the presentation of patients who required intubation.1 The WHO guidelines classify severe pneumonia in COVID-19 patients as SpO2 <93%, respiratory rate >30 breaths/min, or severe respiratory distress.2 As such, I created features that represented percentage of readings where the respiratory rate was greater than 24, 25, 28, 30, 32, and 34 breaths/min and oxygen saturation was less than 85%, 87%, and 90%.

Event Lab Measure Minimum Mean Median Maximum
1 No diastolic 48.47460 60.15270 60.15916 69.58681
6 Yes diastolic 49.45293 59.78746 59.79267 69.92931
2 No heart_rate 65.11754 75.28557 75.26942 84.53421
7 Yes heart_rate 65.29784 74.65946 74.62560 85.26364
3 No resp_rate 19.83511 30.01062 30.02795 39.95080
8 Yes resp_rate 20.65072 29.95777 29.93250 40.62065
4 No spo2 83.26979 92.44655 92.40854 103.20654
9 Yes spo2 82.40515 92.51879 92.51551 102.09935
5 No systolic 120.17357 130.10846 130.12741 139.71925
10 Yes systolic 119.22616 129.86097 129.84965 139.43613

Tree-based methods

Given the number and diversity of types of features, I first tried tree-based methods optimized for performance with gradient boosting and AdaBoost. This is because tree-based models can partition the feature space in a non-parametric manner. Gradient boosting and AdaBoost take an otherwise weak learner of a decision tree and improve performance by learning from previous trees in a manner that reduces error. Additionally, these while these are both “black box” methods, they can provide relative feature importance, which allows them to be useful both for inference and prediction.

For both methods, I first selected the tuning parameters of interaction depth, number of trees, and shrinkage using GridSearch and 5-fold cross-validation. Then I fit the model on the training data, and assessed performance predominantly on the testing error (i.e., generalization error).

Gradient boosted

## The optimal shrinkage parameter is 0.009 out of the range of 0 - 0.01 tested.
## The optimal number of trees is 5000 out of the range of 1000 - 5000 tested.
## The optimal interaction depth is 2 out of the range of 1 - 2 tested.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 137  29
##          1  46 123
##                                           
##                Accuracy : 0.7761          
##                  95% CI : (0.7276, 0.8196)
##     No Information Rate : 0.5463          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5526          
##                                           
##  Mcnemar's Test P-Value : 0.06467         
##                                           
##             Sensitivity : 0.7486          
##             Specificity : 0.8092          
##          Pos Pred Value : 0.8253          
##          Neg Pred Value : 0.7278          
##              Prevalence : 0.5463          
##          Detection Rate : 0.4090          
##    Detection Prevalence : 0.4955          
##       Balanced Accuracy : 0.7789          
##                                           
##        'Positive' Class : 0               
## 

AdaBoost

## The optimal shrinkage parameter is 0.009 out of the range of 0 - 0.01 tested.
## The optimal number of trees is 5000 out of the range of 1000 - 5000 tested.
## The optimal interaction depth is 2 out of the range of 1 - 2 tested.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 135  31
##          1  48 121
##                                          
##                Accuracy : 0.7642         
##                  95% CI : (0.715, 0.8086)
##     No Information Rate : 0.5463         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.5287         
##                                          
##  Mcnemar's Test P-Value : 0.07184        
##                                          
##             Sensitivity : 0.7377         
##             Specificity : 0.7961         
##          Pos Pred Value : 0.8133         
##          Neg Pred Value : 0.7160         
##              Prevalence : 0.5463         
##          Detection Rate : 0.4030         
##    Detection Prevalence : 0.4955         
##       Balanced Accuracy : 0.7669         
##                                          
##        'Positive' Class : 0              
## 

Comparison of tree-based methods

Both gradient boosting and AdaBoost allow for assessment of feature importance. Using both methodologies, the order of feature importance was roughly similar. Overall, age and BMI were highly important, followed by various lab measures and duration of symptoms. In general, the continuous variables had a greater influence on the outcome than the categorical features. Of the categorical features, cancer smoking status, and diabetes ranked among the highest. Of the features I engineered based on the literature, as suggested by the histograms, SpO2 < 90% and respiratory rate > 28 breaths per minute were the most important.

To compare which method was preferable, I assessed the ROC curve AUC for the training and testing data. The performance was incredible similar with ROC AUC of 0.9 for both models. Since the training ROC AUC was also quite similar, it was challenging to assess whether one model was more “overfit” relative to the other, thus demonstrating high bias. Therefore, I also assessed the variance of the models by looking at the standard error of the cross-validation error from the model training output. Using this output, the AdaBoost model appeared to have less variance, and thus I would select this method as preferable.

Features Importance, GBM Importance, AdaBoost
1 age 14.6929686 14.3372478
3 bmi 11.7320259 11.5205599
24 heart_rate.mean 8.2223480 7.8402792
25 heart_rate.median 5.6659504 5.7081920
12 diastolic.max 5.3589183 5.6541987
13 diastolic.mean 3.9473109 3.5871579
17 duration_symptoms 2.6966765 3.0273976
14 diastolic.median 2.6057753 2.6351449
59 systolic.median 2.5352015 2.4616747
61 systolic.skew 2.4907757 2.4649863
26 heart_rate.min 2.2999222 2.3734553
57 systolic.max 2.2231788 2.2615455
38 resp_rate.kurtosis 2.0459262 2.0074059
58 systolic.mean 1.8950444 1.8855952
56 systolic.kurtosis 1.8031273 1.9575278
47 spo2.max 1.5831948 1.8680507
46 spo2.kurtosis 1.5452475 1.4677714
54 spo2_90 1.5196014 1.4480661
50 spo2.min 1.4013477 1.6473623
60 systolic.min 1.3959180 1.5195242
11 diastolic.kurtosis 1.3931699 1.2962214
16 diastolic.skew 1.3906031 1.3003317
23 heart_rate.max 1.3882695 1.3152081
43 resp_rate.skew 1.3139206 1.2322535
15 diastolic.min 1.2632282 1.1223651
27 heart_rate.skew 1.2611701 1.1502193
42 resp_rate.min 1.2047795 1.3334680
34 resp_28 1.1334608 1.3601187
55 spo2_93 1.0353126 0.9139220
22 heart_rate.kurtosis 0.9600189 1.1151746
5 cancer 0.9393139 0.9896885
36 resp_32 0.9343096 1.0028440
39 resp_rate.max 0.9085980 0.9886775
45 smoke_vape 0.7660981 0.8229905
51 spo2.skew 0.7543462 0.7602181
33 resp_26 0.7070544 0.7146722
41 resp_rate.median 0.6614654 0.6690283
35 resp_30 0.5698405 0.4764221
49 spo2.median 0.4914154 0.4814679
48 spo2.mean 0.4297872 0.4823535
37 resp_34 0.4064184 0.3971066
40 resp_rate.mean 0.3800149 0.3988914
53 spo2_87 0.3381063 0.2927159
9 diabetes 0.3206089 0.3017999
19 ed_before_order_set 0.2435553 0.2151431
64 xray_pleural_effusion 0.2008500 0.2464926
21 fever 0.1972144 0.2178333
31 nausea_vomit 0.1456769 0.1560759
63 xray_clear 0.1354004 0.1772019
32 resp_24 0.0755645 0.0693941
44 sex 0.0627708 0.0594268
30 myalgias 0.0572895 0.0428114
52 spo2_85 0.0406701 0.0340323
28 hypertension 0.0352648 0.0514657
8 cough 0.0352396 0.0329172
10 diarrhea 0.0351415 0.0216003
65 xray_unilateral_infiltrate 0.0260030 0.0142644
29 hypoxia 0.0238360 0.0151726
62 xray_bilateral_infiltrates 0.0215882 0.0088282
20 esrd 0.0128579 0.0198408
7 copd 0.0105663 0.0088362
18 dyspnea 0.0101991 0.0093264
2 any_immunosuppression 0.0092275 0.0051901
6 ckd 0.0060096 0.0028444
4 cad 0.0033050 0.0000000

## The standard error of cross-validation error with gradient boosting is 0.997%
## The standard error of cross-validation error with AdaBoost is 0.900%

Support vector machine

Then, I wanted to test if another methodology would potentially improve upon the performance of a tree-based model. I selected another “black box” model: a support vector machine with a radial kernel. Due to the radial kernel’s ability to project the data into infinite dimensions, it allows the SVM to find separating hyperplanes for data that is otherwise not strictly separable. However, interpretability is diminished with this approach. We can, however, output relative feature importance, just as we did with the tree-based models.

To build this model, I first scaled the features and then selected the tuning parameters C and \(\sigma\) using Grid-Search and 5-fold cross-validation. Based on the cross-valdiation accuracy, the model with \(\sigma\) of 0.1 suffered from overfitting and performed very poorly, consistent with expected results from higher values of this tuning parameter. Overall, the the cross-validation accuracy was the highest for the model fit with a \(\sigma\) of 0.001. This is also apparent when comparing the ROC curves for the training and testing error, where the test performance is best \(\sigma\) of 0.001 despite having the lowest training AUC.

The feature importance is roughly similar to those resulting from the tree-based methods. Again, the continuous variables appeared to be the most important features, with age, heart rate, and BMI as the top 3 features. Of the categorical features, diabetes, admitted to ED before lab order set change, fever, myalgias, and smoking status were the most important.

No Yes
age 100.0000000 100.0000000
heart_rate.mean 76.7510305 76.7510305
heart_rate.median 72.0219731 72.0219731
bmi 63.0554928 63.0554928
heart_rate.min 50.8030869 50.8030869
heart_rate.max 39.6342840 39.6342840
diastolic.max 36.5707645 36.5707645
diastolic.mean 36.3065526 36.3065526
diastolic.median 35.0691048 35.0691048
systolic.median 30.6494093 30.6494093
duration_symptoms 27.0817134 27.0817134
systolic.mean 26.9086379 26.9086379
systolic.max 21.6545012 21.6545012
diabetes 18.1687444 18.1687444
systolic.skew 17.4455063 17.4455063
ed_before_order_set 15.2306419 15.2306419
systolic.kurtosis 14.0308191 14.0308191
heart_rate.skew 13.4873454 13.4873454
spo2_93 11.4564259 11.4564259
fever 10.9497412 10.9497412
myalgias 10.7499101 10.7499101
spo2.kurtosis 10.4254981 10.4254981
smoke_vape 10.2984089 10.2984089
spo2.mean 9.0241720 9.0241720
xray_bilateral_infiltrates 8.8795244 8.8795244
spo2_90 8.4648122 8.4648122
resp_rate.skew 8.3703314 8.3703314
cancer 8.3661508 8.3661508
diastolic.min 8.1094640 8.1094640
spo2.median 7.5960903 7.5960903
dyspnea 7.0894056 7.0894056
spo2.min 6.4489427 6.4489427
resp_26 6.2399144 6.2399144
systolic.min 6.2014532 6.2014532
resp_rate.kurtosis 5.9790470 5.9790470
esrd 5.8469411 5.8469411
sex 5.6838990 5.6838990
spo2_87 5.5225291 5.5225291
diarrhea 5.4974457 5.4974457
resp_30 5.1094891 5.1094891
diastolic.kurtosis 3.9723748 3.9723748
cad 3.8628440 3.8628440
resp_rate.max 3.7215408 3.7215408
any_immunosuppression 3.1320808 3.1320808
hypoxia 2.8319161 2.8319161
resp_28 2.7232214 2.7232214
xray_unilateral_infiltrate 2.5551626 2.5551626
resp_24 2.4983069 2.4983069
resp_34 2.4088427 2.4088427
ckd 2.4071705 2.4071705
resp_rate.median 2.1295809 2.1295809
cough 2.1136947 2.1136947
xray_pleural_effusion 2.0008194 2.0008194
resp_rate.mean 1.9138636 1.9138636
xray_clear 1.6070100 1.6070100
spo2.skew 1.3402898 1.3402898
resp_32 1.0484862 1.0484862
spo2_85 0.9105275 0.9105275
copd 0.6203962 0.6203962
diastolic.skew 0.6162156 0.6162156
spo2.max 0.5978211 0.5978211
heart_rate.kurtosis 0.5643766 0.5643766
resp_rate.min 0.3921372 0.3921372
hypertension 0.2006672 0.2006672
nausea_vomit 0.0000000 0.0000000

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  138  34
##        Yes  39 138
##                                           
##                Accuracy : 0.7908          
##                  95% CI : (0.7443, 0.8323)
##     No Information Rate : 0.5072          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5817          
##                                           
##  Mcnemar's Test P-Value : 0.6397          
##                                           
##             Sensitivity : 0.7797          
##             Specificity : 0.8023          
##          Pos Pred Value : 0.8023          
##          Neg Pred Value : 0.7797          
##              Prevalence : 0.5072          
##          Detection Rate : 0.3954          
##    Detection Prevalence : 0.4928          
##       Balanced Accuracy : 0.7910          
##                                           
##        'Positive' Class : No              
## 
## The standard error of cross-validation error with SVM is 1.316%

Lasso regression

Since many of the features, particularly those we engineered, are correlated to one another, this precludes us from deploying typical generalized linear models, such as logistic regression due to multicollinearity, particularly because we are concerned both with inference and prediction.

However, a regularized regression model, such as Lasso could be used to generate interpretable coefficients. As with SVM, the first step to interpret the relative importance of the features is to scale them. Next, I used cross-validation to determine the optimal \(\lambda\) shrinakge parameter. Then I fit the model and assessed performance using training, testing, and cross-validation error. Additionally, I included all interaction terms in the model to attempt to understand how the features influenced one another.

While the other models have provided relative feature importance, I had not yet investigated how these features affect the outcome because the tree-based models are non-parametric and I had not evaluated the coefficients of the SVM. Looking at the coefficients, it becomes clear that increased age, heart rate, BMI, and systolic blood pressure are associated with decreased risk of intubation and/or death.

All of the top features associated with an increased risk of intubation or death are interaction terms. Top interaction terms associated with an increased risk of an event are having diabetes and dyspnea, having a pleural effusion on x-ray and respiratory rate above 32 breaths/min, being a smoker and having dyspnea, and being a smoker and being hypoxic.

With the other methods, there were multiple features generated from the same lab value terms (e.g., heart rate mean and median) in the top important features. With the Lasso regression, this occured less frequently as the coefficients of repetitive features was shrunk to zero.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  139  46
##        Yes  38 126
##                                           
##                Accuracy : 0.7593          
##                  95% CI : (0.7109, 0.8032)
##     No Information Rate : 0.5072          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5182          
##                                           
##  Mcnemar's Test P-Value : 0.445           
##                                           
##             Sensitivity : 0.7853          
##             Specificity : 0.7326          
##          Pos Pred Value : 0.7514          
##          Neg Pred Value : 0.7683          
##              Prevalence : 0.5072          
##          Detection Rate : 0.3983          
##    Detection Prevalence : 0.5301          
##       Balanced Accuracy : 0.7589          
##                                           
##        'Positive' Class : No              
## 
## The standard error of cross-validation error with a Lasso regression is 1.364%
## 2052 variables out of 2145 had coefficients reduced to zero.
## 93 variables out of 2145 had coefficients greater than zero.
features coefficeints odd_ratio
1 age -1.0350460 0.3552100
32 heart_rate.mean -0.7848897 0.4561700
3 bmi -0.7341888 0.4798946
388 diabetesYes:dyspneaChecked 0.4174314 1.5180573
1273 xray_pleural_effusionChecked:resp_32 0.4080340 1.5038582
1114 xray_clearChecked:duration_symptoms -0.3790310 0.6845244
724 cancerYes:xray_bilateral_infiltratesChecked -0.3655983 0.6937814
329 smoke_vapeYes:dyspneaChecked 0.2746908 1.3161237
40 systolic.median -0.2383704 0.7879108
799 any_immunosuppressionYes:systolic.max -0.2222619 0.8007057
255 hypoxiaYes:smoke_vapeYes 0.2088478 1.2322575
795 any_immunosuppressionYes:diastolic.max -0.2056084 0.8141519
1076 dyspneaChecked:diastolic.mean -0.1886905 0.8280428
1218 xray_bilateral_infiltratesChecked:diastolic.max -0.1841598 0.8318029
151 sexMale:duration_symptoms -0.1810089 0.8344279
650 esrdChecked:resp_34 0.1530776 1.1654155
837 feverChecked:heart_rate.mean -0.1511557 0.8597138
385 diabetesYes:diarrheaChecked 0.1407827 1.1511745
607 esrdChecked:cancerYes -0.1395897 0.8697150
914 coughChecked:systolic.skew 0.1285054 1.1371276
41 diastolic.max -0.1252398 0.8822853
1069 dyspneaChecked:duration_symptoms -0.1250613 0.8824428
1284 xray_pleural_effusionChecked:heart_rate.kurtosis -0.1244112 0.8830167
204 bmi:coughChecked -0.1216487 0.8854593
514 copdChecked:systolic.min -0.1154708 0.8909466
547 copdChecked:spo2.kurtosis 0.1124590 1.1190264
462 hypertensionYes:systolic.mean -0.1111172 0.8948339
1152 xray_clearChecked:diastolic.kurtosis 0.1105465 1.1168883
634 esrdChecked:spo2.median -0.1044173 0.9008493
745 cancerYes:resp_rate.max 0.0919102 1.0962664
391 diabetesYes:xray_bilateral_infiltratesChecked 0.0804692 1.0837955
2091 resp_34:resp_rate.skew 0.0803895 1.0837091
482 hypertensionYes:resp_34 -0.0793640 0.9237036
403 diabetesYes:spo2.mean 0.0792707 1.0824973
1117 xray_clearChecked:heart_rate.min -0.0787808 0.9242425
836 feverChecked:diastolic.mean -0.0774835 0.9254423
1337 ed_before_order_setYes:heart_rate.median -0.0765701 0.9262880
275 hypoxiaYes:ed_before_order_setYes -0.0749187 0.9278189
18 myalgiasChecked -0.0684763 0.9338156
769 any_immunosuppressionYes:coughChecked 0.0659893 1.0682152
347 smoke_vapeYes:heart_rate.median -0.0646944 0.9373539
1360 ed_before_order_setYes:heart_rate.skew 0.0617177 1.0636620
1304 duration_symptoms:spo2.max -0.0572311 0.9443758
334 smoke_vapeYes:duration_symptoms -0.0559055 0.9456285
948 diarrheaChecked:spo2.max 0.0527297 1.0541446
613 esrdChecked:myalgiasChecked 0.0524342 1.0538332
163 sexMale:diastolic.median -0.0498722 0.9513510
66 age:sexMale -0.0490329 0.9521498
460 hypertensionYes:resp_rate.mean -0.0490045 0.9521768
31 diastolic.mean -0.0470021 0.9540854
413 diabetesYes:spo2.max 0.0437650 1.0447369
1723 diastolic.median:spo2_93 0.0432961 1.0442471
24 duration_symptoms -0.0410471 0.9597840
2145 systolic.kurtosis:heart_rate.kurtosis 0.0394980 1.0402884
949 diarrheaChecked:systolic.max -0.0394520 0.9613161
1341 ed_before_order_setYes:diastolic.max -0.0370975 0.9635822
78 age:feverChecked -0.0360910 0.9645525
1025 myalgiasChecked:diastolic.min -0.0360690 0.9645738
25 ed_before_order_setYes -0.0342972 0.9662843
458 hypertensionYes:diastolic.mean -0.0336352 0.9669241
137 sexMale:esrdChecked 0.0324941 1.0330278
1416 heart_rate.min:spo2.median 0.0290707 1.0294973
91 age:heart_rate.min -0.0283885 0.9720107
409 diabetesYes:systolic.median -0.0279008 0.9724848
1562 diastolic.mean:resp_rate.max -0.0270074 0.9733540
120 age:resp_rate.skew 0.0251900 1.0255099
396 diabetesYes:heart_rate.min -0.0240975 0.9761905
1872 heart_rate.max:systolic.max -0.0240638 0.9762235
1942 systolic.max:resp_28 -0.0236850 0.9765933
1583 diastolic.mean:systolic.kurtosis -0.0236219 0.9766549
205 bmi:diarrheaChecked -0.0234469 0.9768259
85 age:xray_unilateral_infiltrateChecked -0.0201605 0.9800413
1537 systolic.min:resp_28 0.0187411 1.0189178
1933 spo2.max:spo2.kurtosis 0.0179319 1.0180937
1309 duration_symptoms:spo2_93 0.0161090 1.0162394
1442 heart_rate.min:heart_rate.kurtosis 0.0157948 1.0159202
470 hypertensionYes:resp_rate.max -0.0155562 0.9845641
1086 dyspneaChecked:diastolic.max -0.0153870 0.9847308
1421 heart_rate.min:spo2.max 0.0148290 1.0149395
827 feverChecked:xray_bilateral_infiltratesChecked 0.0139351 1.0140327
397 diabetesYes:resp_rate.min -0.0119229 0.9881479
158 sexMale:diastolic.mean -0.0108361 0.9892224
1535 systolic.min:resp_24 0.0100020 1.0100522
1544 systolic.min:systolic.skew -0.0083029 0.9917315
2086 resp_32:resp_rate.kurtosis 0.0073886 1.0074160
2088 resp_32:spo2.kurtosis 0.0066133 1.0066352
342 smoke_vapeYes:heart_rate.mean -0.0053829 0.9946316
68 age:hypoxiaYes -0.0053626 0.9946518
1972 spo2_85:spo2.kurtosis -0.0051419 0.9948713
2004 spo2_90:heart_rate.skew 0.0043740 1.0043836
449 hypertensionYes:xray_bilateral_infiltratesChecked 0.0038531 1.0038605
2114 diastolic.skew:diastolic.kurtosis -0.0026672 0.9973364
2142 diastolic.kurtosis:heart_rate.kurtosis -0.0014104 0.9985906

Comparison of models

Overall, while the accuracy was higher for the SVM model, the ROC AUC was higher for the tree-based methods. Additionally, the standard error of the cross-validation error from the model training is highest for the Lasso and SVM models, followed by the gradient boosted model, and lowest for the AdaBoost model. Thus, the AdaBoost model still seems to perform the best on the whole and is the model I would recommend for performance. If inference is the goal, the Lasso regression would be the optimal model as it produces interpretable coefficients, as discussed in the Lasso section.

Gradient boosted AdaBoost SVM Lasso regression
Accuracy 77.6% 76.4% 79.1% 75.9%
AUC 0.895 0.895 0.863 0.845
SE, CV Training Error 1.00% 0.90% 1.32% 1.36%

Appendix

In addition to the analyses described in the body of the report, I also investigated the usage of derived features generated by PCA and a neural network.

PCA

I also investigated using derived features generated by PCA. However, based on the scree plot, the first 17 principal components are needed to explain 80% of the variance (a common “rule of thumb” for percent of variance needed to be explained), so PCA does not enable mapping to a much lower dimensional space, especially compared to the Lasso regression. As the plots of the first 2 and 3 principle components show, the classes are not clearly separable with the first 3 PCs.

Neural network

I also considered using a neural network to enhance performance. However, it did not appear that the neural network meaningfully improved performance and did not allow for inference, given that it is a true black box model.

I used the AdaBoost model to select the top 20 most important features to include in the neural net model as neural network performance declines if many extraneous features are included.

1 2 3 4 5
0.775000 0.771875 0.784375 0.771875 0.771875
0.775000 0.781250 0.762500 0.768750 0.765625
0.740625 0.771875 0.775000 0.750000 0.731250
0.750000 0.740625 0.731250 0.740625 0.771875
0.728125 0.750000 0.771875 0.778125 0.750000


  1. Hur, K., et al. Factors Associated With Intubation and Prolonged Intubation in Hospitalized Patients With COVID-19. Otolaryngol Head Neck Surg 163, 170-178 (2020).

  2. Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected. (World Health Organization, 2020).